Project: Movies Dataset Analysis

Table of Contents

Introduction

What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?

This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.

In [1]:
# importing sweetviz
#import sys
#!{sys.executable} -m pip install sweetviz;

# importing plotly
import sys
!{sys.executable} -m pip install plotly_express
Requirement already satisfied: plotly_express in /opt/conda/lib/python3.6/site-packages (0.4.1)
Requirement already satisfied: scipy>=0.18 in /opt/conda/lib/python3.6/site-packages (from plotly_express) (1.2.1)
Requirement already satisfied: plotly>=4.1.0 in /opt/conda/lib/python3.6/site-packages (from plotly_express) (4.14.3)
Requirement already satisfied: pandas>=0.20.0 in /opt/conda/lib/python3.6/site-packages (from plotly_express) (0.23.3)
Requirement already satisfied: patsy>=0.5 in /opt/conda/lib/python3.6/site-packages (from plotly_express) (0.5.1)
Requirement already satisfied: statsmodels>=0.9.0 in /opt/conda/lib/python3.6/site-packages (from plotly_express) (0.12.2)
Requirement already satisfied: numpy>=1.11 in /opt/conda/lib/python3.6/site-packages (from plotly_express) (1.12.1)
Requirement already satisfied: retrying>=1.3.3 in /opt/conda/lib/python3.6/site-packages (from plotly>=4.1.0->plotly_express) (1.3.3)
Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from plotly>=4.1.0->plotly_express) (1.11.0)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/lib/python3.6/site-packages (from pandas>=0.20.0->plotly_express) (2.6.1)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas>=0.20.0->plotly_express) (2017.3)
In [2]:
import pandas as pd
import numpy as np
import plotly.express as px
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.graph_objects as go
#import sweetviz as sv

Data Wrangling

General Properties

In [3]:
df_orig = pd.read_csv("tmdb_movies.csv")
df_movies = df_orig.copy()
df_movies.set_index('id',inplace=True)
In [4]:
df_orig.head()
Out[4]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09

5 rows × 21 columns

In [5]:
df_orig.shape
Out[5]:
(10866, 21)
In [6]:
# checking columns data types and missing values
df_orig.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
In [7]:
# showing statistics
df_orig.describe()
Out[7]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09

Insight

If the SD is more then ONE THIRD OF Arithmetic mean then it is considered high; therefore both popularity, budget_adj, revenue_adj and vote_count contains significant outliers.

This can be displayed in the following histograms

In [8]:
fig = px.histogram(df_orig, x="popularity")
fig.show()
In [9]:
fig = px.histogram(df_orig, x="budget_adj")
fig.show()
In [10]:
fig = px.histogram(df_orig, x="revenue_adj")
fig.show()
In [11]:
fig = px.histogram(df_orig, x="vote_count")
fig.show()
In [12]:
#Here I was trying sweetviz but didn't like it
#report = sv.analyze(df_orig, "revenue")
#report.show_html()

Data Cleaning

Removing unneeded columns; we won't need these columns to answer our questions:

id
imdb_id
original_title
homepage tagline
overview
budget (we will use budget_adj for to account for inflatation over time)
revenue (we will use revenue_adj for to account for inflatation over time)
release_date
runtime
keywords
vote_count

In [13]:
# removing unneeded columns
df_movies.drop(['imdb_id','homepage','tagline','overview','budget','revenue','release_date','runtime','keywords','vote_count'],axis=1,inplace=True)
In [14]:
df_movies.shape
Out[14]:
(10866, 10)
In [15]:
df_movies.head()
Out[15]:
popularity original_title cast director genres production_companies vote_average release_year budget_adj revenue_adj
id
135397 32.985763 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6.5 2015 1.379999e+08 1.392446e+09
76341 28.419936 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... George Miller Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 7.1 2015 1.379999e+08 3.481613e+08
262500 13.112507 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... Robert Schwentke Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 6.3 2015 1.012000e+08 2.716190e+08
140607 11.173104 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... J.J. Abrams Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 7.5 2015 1.839999e+08 1.902723e+09
168259 9.335014 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... James Wan Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 7.3 2015 1.747999e+08 1.385749e+09

Dealing with null values; columns with Missing Values:

cast
director
keywords
genres
production_companies

In [16]:
# checking for any null values
df_null = df_movies[df_movies.isna().any(axis=1)]
df_null.shape
Out[16]:
(1093, 10)
In [17]:
#removing null values
df_movies.dropna(axis=0,how="any",inplace=True)

Excluding outliers into a separate df; columns with significant outliers

popularity
budget_adj
revenue_adj
vote_count

Using IQR, we can follow the below approach to exclude the outliers into a separate df:

1 Calculate the first and third quartile (Q1 and Q3).
2 Further, evaluate the interquartile range, IQR = Q3-Q1.
3 Estimate the lower bound, lower_bound= Q1 - 1.5 x IQR
4 Estimate the upper bound, upper_bound= Q3 + 1.5 x IQR
5 Exclude all data points that lie outside of the lower and the upper bound into a separate df.

In [18]:
# popularity
Q1 = df_movies.popularity.quantile(0.25)
Q3 = df_movies.popularity.quantile(0.75)
IQR = Q3-Q1 
lower_bound = Q1 - 1.5*IQR  
upper_bound = Q3 + 1.5*IQR
df_outliers_popularity = df_movies[(df_movies['popularity'] > upper_bound) | (df_movies['popularity'] < lower_bound)]
df_outliers_popularity.shape
Out[18]:
(845, 10)
In [19]:
# budget_adj
Q1 = df_movies.budget_adj.quantile(0.25)
Q3 = df_movies.budget_adj.quantile(0.75)
IQR = Q3-Q1 
lower_bound = Q1 - 1.5*IQR  
upper_bound = Q3 + 1.5*IQR
df_outliers_budget_adj = df_movies[(df_movies['budget_adj'] > upper_bound) | (df_movies['budget_adj'] < lower_bound)]
df_outliers_budget_adj.shape
Out[19]:
(1000, 10)
In [20]:
# revenue_adj
Q1 = df_movies.revenue_adj.quantile(0.25)
Q3 = df_movies.revenue_adj.quantile(0.75)
IQR = Q3-Q1 
lower_bound = Q1 - 1.5*IQR  
upper_bound = Q3 + 1.5*IQR
df_outliers_revenue_adj = df_movies[(df_movies['revenue_adj'] > upper_bound) | (df_movies['revenue_adj'] < lower_bound)]
df_outliers_revenue_adj.shape
Out[20]:
(1402, 10)
In [21]:
df_movies.shape
Out[21]:
(9773, 10)
In [22]:
df_outliers = pd.concat([df_outliers_popularity, df_outliers_budget_adj, df_outliers_revenue_adj])
df_outliers['id']=df_outliers.index
df_outliers.drop_duplicates('id',inplace=True)
df_outliers.shape
Out[22]:
(1914, 11)
In [23]:
df_movies.drop(df_outliers.index,inplace= True)
df_movies.shape
Out[23]:
(7859, 10)

Removing duplicates

In [24]:
df_movies.duplicated().sum()
Out[24]:
1
In [25]:
df_movies.drop_duplicates(inplace=True)
In [26]:
df_movies.shape
Out[26]:
(7858, 10)

Data Reformatting

Here we are gonna check and fix data types issues as well as reformat our data structure

In [27]:
df_movies.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7858 entries, 188222 to 22293
Data columns (total 10 columns):
popularity              7858 non-null float64
original_title          7858 non-null object
cast                    7858 non-null object
director                7858 non-null object
genres                  7858 non-null object
production_companies    7858 non-null object
vote_average            7858 non-null float64
release_year            7858 non-null int64
budget_adj              7858 non-null float64
revenue_adj             7858 non-null float64
dtypes: float64(4), int64(1), object(5)
memory usage: 675.3+ KB
In [28]:
df_movies.head(3)
Out[28]:
popularity original_title cast director genres production_companies vote_average release_year budget_adj revenue_adj
id
188222 1.532997 Entourage Jeremy Piven|Adrian Grenier|Jerry Ferrara|Kevi... Doug Ellin Comedy Warner Bros.|Closest to the Hole Productions 6.2 2015 2.759999e+07 4.532231e+07
252838 1.510096 The Wedding Ringer Kevin Hart|Josh Gad|Kaley Cuoco|Affion Crocket... Jeremy Garelick Comedy Miramax Films|Screen Gems|LStar Capital 6.5 2015 2.115999e+07 7.341586e+07
273477 1.499614 Scouts Guide to the Zombie Apocalypse Tye Sheridan|Logan Miller|Joey Morgan|Sarah Du... Christopher B. Landon Comedy|Horror Paramount Pictures|Oops Doughnuts Productions|... 6.2 2015 1.379999e+07 1.367190e+07
In [29]:
# changing release_year to int
df_movies['release_year'] = pd.to_numeric(df_movies['release_year'] , downcast='integer')
df_movies.info()
df_movies.shape
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7858 entries, 188222 to 22293
Data columns (total 10 columns):
popularity              7858 non-null float64
original_title          7858 non-null object
cast                    7858 non-null object
director                7858 non-null object
genres                  7858 non-null object
production_companies    7858 non-null object
vote_average            7858 non-null float64
release_year            7858 non-null int16
budget_adj              7858 non-null float64
revenue_adj             7858 non-null float64
dtypes: float64(4), int16(1), object(5)
memory usage: 629.3+ KB
Out[29]:
(7858, 10)
In [30]:
# solving problem with genres column
df_movies['genres_adj']= df_movies['genres'].str.split("|")
# getting unique values of genres column for explatory section
genres = df_movies.genres_adj.values
genres = np.hstack(genres)
genres_unique , genres_values_count = np.unique(genres,return_counts=True)
genres_dict = dict(zip(genres_unique, genres_values_count))
In [31]:
# dropping genres column
df_movies.drop(['genres'],axis=1,inplace=True)
In [32]:
df_movies.head(0)
Out[32]:
popularity original_title cast director production_companies vote_average release_year budget_adj revenue_adj genres_adj
id
In [33]:
# solving problem with director column
df_movies['director_adj']= df_movies['director'].str.split("|")
# getting unique values of director column for explatory section
directors = df_movies.director_adj.values
directors = np.hstack(directors)
directors_unique , directors_values_count = np.unique(directors,return_counts=True)
directors_dict = dict(zip(directors_unique, directors_values_count))
In [34]:
# dropping director column
df_movies.drop(['director'],axis=1,inplace=True)
In [35]:
df_movies.head(0)
Out[35]:
popularity original_title cast production_companies vote_average release_year budget_adj revenue_adj genres_adj director_adj
id
In [36]:
# solving problem with cast column
df_movies['cast_adj']= df_movies['cast'].str.split("|")
# getting unique values of cast column for explatory section
cast = df_movies.cast_adj.values
cast = np.hstack(cast)
cast_unique , cast_values_count = np.unique(cast,return_counts=True)
cast_dict = dict(zip(cast_unique, cast_values_count))
In [37]:
# dropping cast column
df_movies.drop(['cast'],axis=1,inplace=True)
In [38]:
df_movies.head(0)
Out[38]:
popularity original_title production_companies vote_average release_year budget_adj revenue_adj genres_adj director_adj cast_adj
id
In [39]:
# solving problem with production_companies column
df_movies['production_companies_adj']= df_movies['production_companies'].str.split("|")
# getting unique values of production_companies for explatory section
production_companies = df_movies.production_companies_adj.values
production_companies = np.hstack(production_companies)
production_companies_unique , production_companies_values_count = np.unique(production_companies,return_counts=True)
production_companies_dict = dict(zip(production_companies_unique, production_companies_values_count))
In [40]:
# dropping production_companies column
df_movies.drop(['production_companies'],axis=1,inplace=True)
In [41]:
df_movies.head(0)
Out[41]:
popularity original_title vote_average release_year budget_adj revenue_adj genres_adj director_adj cast_adj production_companies_adj
id

Exploratory Data Analysis

Analyzing outlier values

popularity
budget_adj
revenue_adj
vote_count

In [42]:
print(df_outliers.hist(figsize=(10,10)))
[[<matplotlib.axes._subplots.AxesSubplot object at 0x7f0f537fb208>
  <matplotlib.axes._subplots.AxesSubplot object at 0x7f0f51757240>]
 [<matplotlib.axes._subplots.AxesSubplot object at 0x7f0f51708d30>
  <matplotlib.axes._subplots.AxesSubplot object at 0x7f0f517399b0>]
 [<matplotlib.axes._subplots.AxesSubplot object at 0x7f0f516f2ac8>
  <matplotlib.axes._subplots.AxesSubplot object at 0x7f0f516f2b00>]]
  • budget_adj,popularity,revenue_adj,runtime and vote_count are all skewed to the left
  • release_year is positively skewed
  • vote_average is normally distributed
In [43]:
fig = px.scatter(df_outliers, x="budget_adj" ,y="revenue_adj")
fig.show()

There is no coorelation between budget and revenue in these outliers; although there is a positive coorelation between vote_count and popularity

Conclusion
Most of these outliers have very low popularity,revenue and buget but recent release_years

Analyzing null values

In [44]:
print(df_null.hist(figsize=(10,10)))
[[<matplotlib.axes._subplots.AxesSubplot object at 0x7f0f5b643860>
  <matplotlib.axes._subplots.AxesSubplot object at 0x7f0f514d36a0>]
 [<matplotlib.axes._subplots.AxesSubplot object at 0x7f0f51484c18>
  <matplotlib.axes._subplots.AxesSubplot object at 0x7f0f51444828>]
 [<matplotlib.axes._subplots.AxesSubplot object at 0x7f0f514023c8>
  <matplotlib.axes._subplots.AxesSubplot object at 0x7f0f51402400>]]

Rows having null values have similar distribution to outliers but even have lowest popularity,revenue,buget and vote count but more recent release_years than null values

Asking the right questions:

Questions:

1 what attributes affect revenue most?
2 what attributes affect profit most?
3 How did people preference's of film's genres changed over the last 10 years? 4 what was the most profitable movie of the last 10 year??
5 what is the highest grossing movie of all time?
6 what production companies produces most films?
7 what cast members are most popular?
8 who are the directors that direct most films?

1 What attributes affect revenue most?

In [45]:
df_movies.head(0)
Out[45]:
popularity original_title vote_average release_year budget_adj revenue_adj genres_adj director_adj cast_adj production_companies_adj
id
In [46]:
fig = px.scatter(df_movies, x="vote_average" ,y="revenue_adj")
fig.show()
In [47]:
fig = px.scatter(df_movies, x="release_year" ,y="revenue_adj")
fig.show()
In [48]:
fig = px.scatter(df_movies, x="budget_adj" ,y="revenue_adj")
fig.show()

No coorelation

In [49]:
df_movies.reset_index(drop=True, inplace=True)
In [50]:
Total_revenue = []
for genre in genres_unique:
    def calc_total_revenue(genres_adj,revenue_adj):
        if genre in genres_adj:
            return revenue_adj
    revenue = df_movies.apply(lambda row: calc_total_revenue(row['genres_adj'],row['revenue_adj']), axis=1)
    Total_revenue.append(revenue.sum())
In [51]:
plt.figure(figsize=(25,10))
plt.bar(genres_unique, Total_revenue)
plt.title('Total Revenue gained by each genre')
plt.xlabel('Movie Genres')
plt.ylabel('Total Revenue')
Out[51]:
Text(0,0.5,'Total Revenue')

TV movies have lowest revenue while Drama movies have highest revenue

2 What attributes affect profit most?

In [52]:
df_movies['profit'] = df_movies.revenue_adj - df_movies.budget_adj
In [53]:
fig = px.scatter(df_movies, x="vote_average" ,y="profit")
fig.show()

There is a slightly positive coorelation between vote average and profit

In [54]:
fig = px.scatter(df_movies, x="release_year" ,y="profit")
fig.show()

no coorelation between release year and profit

In [55]:
fig = px.scatter(df_movies, x="budget_adj" ,y="profit")
fig.show()

There is a negative coorelation between budget and profit

In [56]:
Total_profit = []
for genre in genres_unique:
    def calc_total_profit(genres_adj,profit):
        if genre in genres_adj:
            return profit
    profit = df_movies.apply(lambda row: calc_total_profit(row['genres_adj'],row['profit']), axis=1)
    Total_profit.append(profit.sum())
In [57]:
plt.figure(figsize=(25,10))
plt.bar(genres_unique, Total_profit)
plt.title('Total Profit gained by each genre')
plt.xlabel('Movie Genres')
plt.ylabel('Total profit')
Out[57]:
Text(0,0.5,'Total profit')

TV movies and western movies have lowest profit while comedy movies have highest profit

3 How did people preference's of film's genres changed over the last 10 years?

In [58]:
years = np.arange(2006,2016)
zeros = np.zeros((len(genres_unique),len(years)), dtype=int)
temp_dict_y = dict(zip(genres_unique,zeros))

def count_genres(genres_list,year):
    for g in genres_list:
        temp_dict_y[g][year%2006]=temp_dict_y[g][year%2006]+1
                              
for year in years:
    df_temp = df_movies[df_movies['release_year']==year]
    df_temp.apply(lambda row:count_genres(row['genres_adj'],year), axis=1)
    
d = []
for genre in genres_unique:
    d.append(go.Bar(name=genre, x=years, y=temp_dict_y[genre]))
    
fig = go.Figure(data=d)
fig.update_layout(barmode='group')
fig.show()

Drama has been the most common type of movies from 2006 to 2015 while western genre has been the least common types of movies in that period

In [59]:
x = years
fig = go.Figure()

for genre in temp_dict_y.keys():
    fig.add_trace(go.Scatter(
        x=x,
        y=temp_dict_y[genre],
        name = genre, # Style name/legend entry with html tags
        connectgaps=True # override default to connect the gaps
    ))

fig.show()

from 2014 thriller movies beacame more common than comedy

4 what was the most profitable movie of the last 10 year?

In [60]:
profit = []
Years_adj = []
for year in years:
    net = df_movies[df_movies['release_year']==year].profit.max()
    name = df_movies[df_movies['profit']==net].original_title.values[0]
    Years_adj.append(str(name)+" ("+str(year)+")")
    profit.append(net)
In [61]:
plt.figure(figsize=(45,25))
plt.bar(Years_adj, profit)
plt.title('Total Profit gained by each movie in the last 10 years')
plt.xlabel('Movie name (year)')
plt.ylabel('profit')
Out[61]:
Text(0,0.5,'profit')

5 what is the highest grossing movie of all time?

In [62]:
max_revenue = df_movies.revenue_adj.max()
movie = df_movies[df_movies.revenue_adj == max_revenue]
movie.original_title.values[0]
Out[62]:
'The Black Hole'

6 what production companies produces most films?

In [63]:
i = production_companies_values_count.tolist().index(production_companies_values_count.max())
production_companies_unique[i]
Out[63]:
'Universal Pictures'
In [64]:
counts = []
names=[]

for name in production_companies_unique:
    count = production_companies_dict[name]
    if count > 100:
        counts.append(count)
        names.append(name)

fig = px.pie(values=counts, names=names, title='percentage of movies produced by each production company')
fig.show()
In [65]:
plt.figure(figsize=(35,15))
plt.bar(names, counts)
plt.title('Total number of films produced by each production company')
plt.xlabel('production company')
plt.ylabel('Number of films')
Out[65]:
Text(0,0.5,'Number of films')

7 what cast members are most popular?

In [66]:
i = cast_values_count.tolist().index(cast_values_count.max())
cast_unique[i]
Out[66]:
'Samuel L. Jackson'
In [67]:
counts = []
names=[]

for name in cast_unique:
    count = cast_dict[name]
    if count > 35:
        counts.append(count)
        names.append(name)
        
fig = px.pie(values=counts, names=names, title='percentage of movies produced by each production company')
fig.show()
In [68]:
plt.figure(figsize=(35,15))
plt.bar(names, counts)
plt.title('Total number of films acted by top cast')
plt.xlabel('cast name')
plt.ylabel('Number of films')
Out[68]:
Text(0,0.5,'Number of films')

8 who are the directors that direct most films?

In [69]:
i = directors_values_count.tolist().index(directors_values_count.max())
directors_unique[i]
Out[69]:
'Woody Allen'
In [70]:
counts = []
names=[]
for name in directors_unique:
    count = directors_dict[name]
    if count > 15:
        counts.append(count)
        names.append(name)
        
fig = px.pie(values=counts, names=names, title='percentage of movies produced by each production company')
fig.show()
In [71]:
plt.figure(figsize=(35,15))
plt.bar(names, counts)
plt.title('Total number of films directed by directors')
plt.xlabel('director name')
plt.ylabel('Number of films')
Out[71]:
Text(0,0.5,'Number of films')

Conclusions

  • movies with lowest budget have lower revenues
  • TV movies have lowest revenue while Drama movies have highest revenue
  • There is a slightly positive coorelation between vote average and profit
  • There is a negative coorelation between budget and profit
  • TV movies and western movies have lowest profit while comedy movies have highest profit
  • Drama has been the most common type of movies from 2006 to 2015 while western genre has been the least common types of movies in that period
  • from 2014 thriller movies beacame more common than comedy
  • The Black Hole highest grossing movie of all time
  • Universal Pictures produces most films
  • Samuel L. Jackson acted in most films
  • Woody Allen directed most films

Limitations

Data contains lot of outliers, many rows contain missing values or zeros while other rows have very large numbers